3.1 - Goal - Learning a Single Economic Task
Our first foray into reinforcement learning will be a foundational "Hello, World!" task. The objective is to create the simplest possible environment where an agent can learn a single, vital economic behavior: producing worker units.
By isolating this one task, we can verify that our entire RL architecture—from the SC2GymEnv
wrapper to the training script—is functioning correctly before adding further complexity.
Problem Definition
We will train an agent whose sole purpose is to learn the correct policy for building SCVs from a Command Center.
- Success Criteria:
- The agent must learn to check for sufficient resources (
minerals >= 50
). - The agent must learn to check for an available production facility (
idle Command Center
). - The agent must learn to correlate the correct game state with the "build worker" action.
- The episode must terminate successfully upon reaching a target of 20 workers.
- The agent must learn to check for sufficient resources (
Environment Design
To facilitate this learning, we will design a minimal environment with precisely defined observation, action, and reward structures.
1. Observation Space (The Agent's Input)
The state provided to the agent must contain only the information necessary to make the "build worker" decision. All values are normalized to a [0.0, 1.0]
range to improve neural network stability.
Index | Feature | Normalization Factor | Purpose |
---|---|---|---|
0 | self.minerals | 1000 | "Can I afford the action?" |
1 | self.workers.amount | 50 | "How close am I to the goal?" |
2 | self.supply_left | 20 | "Do I have the capacity for a new unit?" |
2. Action Space (The Agent's Output)
The agent has a discrete, binary choice on each decision step.
Action Value | Agent's Intent |
---|---|
0 | Do Nothing |
1 | Attempt to build an SCV |
3. Reward Function (The Learning Signal)
The reward function is engineered to provide clear and immediate feedback, guiding the agent toward the correct behavior.
Triggering Condition | Reward Value | Signal Type | Purpose |
---|---|---|---|
"Build" action is taken and is possible | +5 | Sparse | Strongly reinforce the correct decision. |
"Build" action is taken but is impossible | -5 | Sparse | Strongly penalize impossible actions (teaches game rules). |
Every decision step | + (worker_count * 0.1) | Dense | Continuously encourage progress toward the goal. |
Key Learning Challenges for the Agent
- Credit Assignment: The agent must learn that the
+5
reward is directly caused by its "Build" action when resources are sufficient. - Constraint Learning: The agent must learn to avoid the
-5
penalty by paying attention to theminerals
andsupply_left
observations, effectively learning the game's constraints. - Temporal Progression: The dense reward for
worker_count
encourages the agent to not just "do nothing," but to actively pursue the goal over time.
This tightly scoped problem provides a perfect, verifiable test case for our RL framework. In the next section, we will translate this design into code.